Script Identification in Natural Scene Image and Video Frame using Attention based Convolutional-LSTM Network

نویسندگان

Ankan Kumar Bhunia

Aishik Konwer

Abir Bhowmick

Ayan Kumar Bhunia

Partha Pratim Roy

Umapada Pal

چکیده

Script identification plays a significant role in analysing documents and videos. In this paper, we focus on the problem of script identification in scene text images and video scripts. Because of low image quality, complex background and similar layout of characters shared by some scripts like Greek, Latin, etc., text recognition in those cases become challenging. Most of the recent approaches generally use a patch-based CNN network with summation of obtained features, or only a CNN-LSTM network to get the identification result. Some use a discriminative CNN to jointly optimize mid-level representations and deep features. In this paper, we propose a novel method that involves extraction of local and global features using CNN-LSTM framework and weighting them dynamically for script identification. First, we convert the images into patches and feed them into a CNN-LSTM framework. Attention-based patch weights are calculated applying softmax layer after LSTM. Then we do patch-wise multiplication of these weights with corresponding CNN to yield local features. Global features are also extracted from last cell state of LSTM. We employ a fusion technique which dynamically weights the local and global features for an individual patch. Experiments have been done in two public script identification datasets, SIW-13 and CVSI2015. The proposed framework achieves superior results in comparison to conventional methods. Keywords-Script Identification, Convolutional Neural Network, Long Short-Term Memory, Local feature, Global feature, Attention Network, Dynamic Weighting. 1 1 # Both the authors contributed equally.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient dat...

متن کامل

An efficient method for cloud detection based on the feature-level fusion of Landsat-8 OLI spectral bands in deep convolutional neural network

Cloud segmentation is a critical pre-processing step for any multi-spectral satellite image application. In particular, disaster-related applications e.g., flood monitoring or rapid damage mapping, which are highly time and data-critical, require methods that produce accurate cloud masks in a short time while being able to adapt to large variations in the target domain (induced by atmospheric c...

متن کامل

Attention Based CLDNNs for Short-Duration Acoustic Scene Classification

Recently, neural networks with deep architecture have been widely applied to acoustic scene classification. Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory Networks (LSTMs) have shown improvements over fully connected Deep Neural Networks (DNNs). Motivated by the fact that CNNs, LSTMs and DNNs are complimentary in their modeling capability, we apply the CLDNNs (Convolutiona...

متن کامل

Cystoscopy Image Classication Using Deep Convolutional Neural Networks

In the past three decades, the use of smart methods in medical diagnostic systems has attractedthe attention of many researchers. However, no smart activity has been provided in the eld ofmedical image processing for diagnosis of bladder cancer through cystoscopy images despite the highprevalence in the world. In this paper, two well-known convolutional neural networks (CNNs) ...

متن کامل

Decomposing Motion and Content for Natural Video Sequence Prediction

We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independent...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1801.00470 شماره

صفحات -

تاریخ انتشار 2018

Script Identification in Natural Scene Image and Video Frame using Attention based Convolutional-LSTM Network

نویسندگان

چکیده

منابع مشابه

Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

An efficient method for cloud detection based on the feature-level fusion of Landsat-8 OLI spectral bands in deep convolutional neural network

Attention Based CLDNNs for Short-Duration Acoustic Scene Classification

Cystoscopy Image Classication Using Deep Convolutional Neural Networks

Decomposing Motion and Content for Natural Video Sequence Prediction

عنوان ژورنال:

اشتراک گذاری